-
Notifications
You must be signed in to change notification settings - Fork 418
Peer Storage (Part 3): Identifying Lost Channel States #3897
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Peer Storage (Part 3): Identifying Lost Channel States #3897
Conversation
👋 Thanks for assigning @tnull as a reviewer! |
101f31c
to
a35566a
Compare
'PeerStorageMonitorHolder' is used to wrap a single ChannelMonitor, here we are adding some fields separetly so that we do not need to read the whole ChannelMonitor to identify if we have lost some states. `PeerStorageMonitorHolderList` is used to keep the list of all the channels which would be sent over the wire inside Peer Storage.
Create a utililty function to prevent code duplication while writing ChannelMonitors. Serialise them inside ChainMonitor::send_peer_storage and send them over. TODO: Peer storage should not cross 64k limit.
Deserialise the ChannelMonitors and compare the data to determine if we have lost some states.
Node should now determine lost states using retrieved peer storage.
a35566a
to
4c9f3c3
Compare
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #3897 +/- ##
==========================================
- Coverage 88.86% 88.86% -0.01%
==========================================
Files 165 165
Lines 118886 118962 +76
Branches 118886 118962 +76
==========================================
+ Hits 105650 105710 +60
- Misses 10911 10923 +12
- Partials 2325 2329 +4 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
/// | ||
/// [`ChainMonitor`]: crate::chain::chainmonitor::ChainMonitor | ||
#[rustfmt::skip] | ||
pub(crate) fn write_util<Signer: EcdsaChannelSigner, W: Writer>(channel_monitor: &ChannelMonitorImpl<Signer>, is_stub: bool, writer: &mut W) -> Result<(), Error> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@wpaulino what do you think we should reasonably cut here to reduce the size of a ChannelMonitor
without making the emergency-case ChannelMonitor
s all that different from the regular ones to induce more code changes across channelmonitor.rs
? Obviously we should avoid counterparty_claimable_outpoints
, but how much code is gonna break in doing so?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not too familiar with the goals here, but if the idea is for the emergency-case ChannelMonitor
to be able to recover funds, wouldn't it need to handle a commitment confirmation from either party? That means we need to track most things, even counterparty_claimable_outpoints
(without the sources though) since the counterparty could broadcast a revoked commitment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Basically. I think ideally we find a way to store everything (required) but counterparty_claimable_outpoints
so that we can punish the counterparty on their balance+reserve if they broadcast a stale state, even if not HTLCs (though of course they can't claim the HTLCs without us being able to punish them on the next stage). Not sure how practical that is today without counterparty_claimable_outpoints
but I think that's the goal.
@adi2011 maybe for now let's just write the full monitors, but leave a TODO to strip out what we can later. For larger nodes that means all our monitors will be too large and we'll never back any up but that's okay.
} | ||
}, | ||
None => { | ||
// TODO: Figure out if this channel is so old that we have forgotten about it. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's no need to worry here, I think. If the channel is gone we either haven't fallen behind (probably) or we already broadcasted a stale state (because we broadcast on startup if the channel is gone and we have a ChannelMonitor
) at which point were screwed. So nothing to do here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for clarifying, I will remove this.
🔔 1st Reminder Hey @tnull! This PR has been waiting for your review. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Took a first look, but will hold off with going more into details until we decided on which way we should go with the ChannelMonitor
stub,
}, | ||
|
||
Err(e) => { | ||
panic!("Wrong serialisation of PeerStorageMonitorHolderList: {}", e); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we should ever panic in any of this code. Yes, something might be wrong if we have peer storage data we can't read anymore, but really no reason to refuse to at least keep other potential channels operational.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that makes sense, I think we should only panic if we have determined that we have lost some channel state.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Few more comments, let's move forward without blocking on the ChannelMonitor
serialization stuff.
lightning/src/ln/our_peer_storage.rs
Outdated
impl Writeable for PeerStorageMonitorHolderList { | ||
fn write<W: Writer>(&self, w: &mut W) -> Result<(), io::Error> { | ||
encode_tlv_stream!(w, { (1, &self.monitors, required_vec) }); | ||
Ok(()) | ||
} | ||
} | ||
|
||
impl Readable for PeerStorageMonitorHolderList { | ||
fn read<R: io::Read>(r: &mut R) -> Result<Self, DecodeError> { | ||
let mut monitors: Option<Vec<PeerStorageMonitorHolder>> = None; | ||
decode_tlv_stream!(r, { (1, monitors, optional_vec) }); | ||
|
||
Ok(PeerStorageMonitorHolderList { monitors: monitors.ok_or(DecodeError::InvalidValue)? }) | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You should be able to replace both of these with a single impl_writeable_tlv_based
macro call.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed, thanks for this!
/// | ||
/// [`ChainMonitor`]: crate::chain::chainmonitor::ChainMonitor | ||
#[rustfmt::skip] | ||
pub(crate) fn write_util<Signer: EcdsaChannelSigner, W: Writer>(channel_monitor: &ChannelMonitorImpl<Signer>, is_stub: bool, writer: &mut W) -> Result<(), Error> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Basically. I think ideally we find a way to store everything (required) but counterparty_claimable_outpoints
so that we can punish the counterparty on their balance+reserve if they broadcast a stale state, even if not HTLCs (though of course they can't claim the HTLCs without us being able to punish them on the next stage). Not sure how practical that is today without counterparty_claimable_outpoints
but I think that's the goal.
@adi2011 maybe for now let's just write the full monitors, but leave a TODO to strip out what we can later. For larger nodes that means all our monitors will be too large and we'll never back any up but that's okay.
lightning/src/chain/chainmonitor.rs
Outdated
let random_bytes = self.entropy_source.get_secure_random_bytes(); | ||
let serialised_channels = Vec::new(); | ||
|
||
// TODO(aditya): Choose n random channels so that peer storage does not exceed 64k. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be pretty easy? We have random bytes, just make an outer loop that selects a random monitor (by doing monitors.iter().skip(random_usize % monitors.len()).next()
)
lightning/src/ln/channelmanager.rs
Outdated
@@ -8807,6 +8808,7 @@ This indicates a bug inside LDK. Please report this error at https://github.com/ | |||
&self, peer_node_id: PublicKey, msg: msgs::PeerStorageRetrieval, | |||
) -> Result<(), MsgHandleErrInternal> { | |||
// TODO: Check if have any stale or missing ChannelMonitor. | |||
let per_peer_state = self.per_peer_state.read().unwrap(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No need to take the (read) lock at the top, do it after we decrypt.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did a ~first pass.
This needs a rebase now, in particular now that #3922 landed.
@@ -810,10 +813,53 @@ where | |||
} | |||
|
|||
fn send_peer_storage(&self, their_node_id: PublicKey) { | |||
// TODO: Serialize `ChannelMonitor`s inside `our_peer_storage`. | |||
|
|||
static MAX_PEER_STORAGE_SIZE: usize = 65000; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be a const
rather than static
, I think? Also, would probably make sense to add this add the module level, with some docs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also isn't the max size 64 KiB, not 65K?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, my bad, thanks for pointing this out, It should be 65531.
let random_bytes = self.entropy_source.get_secure_random_bytes(); | ||
let serialised_channels = Vec::new(); | ||
let random_usize = usize::from_le_bytes(random_bytes[0..8].try_into().unwrap()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Depending on the platform, a usize
might not always be 8 bytes. You'll probably need to do
const USIZE_LEN: usize = core::mem::size_of::<usize>();
and use that instead of 8
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed, thanks for pointing it out.
/// NOTE: `is_stub` is true only when we are using this to serialise for Peer Storage. | ||
/// | ||
/// [`ChainMonitor`]: crate::chain::chainmonitor::ChainMonitor | ||
#[rustfmt::skip] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please don't add rustfmt::skip
when introducing new code. Might be good to introduce a commit in the beginning that removes the skip
from fn write
, so that the code changes here are in fact mostly code moves.
let mut curr_size = 0; | ||
|
||
// Randomising Keys in the HashMap to fetch monitors without repetition. | ||
let mut keys: Vec<&ChannelId> = monitors.keys().collect(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we make this a bit cleaner by using the proposed iterator skip
ing approach in the loop below, maybe while simply keeping track of which monitors we already wrote?
let min_seen_secret = mon.monitor.get_min_seen_secret(); | ||
let counterparty_node_id = mon.monitor.get_counterparty_node_id(); | ||
|
||
match write_util(&mon.monitor.inner.lock().unwrap(), true, &mut ser_chan) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Please move taking the lock out into a dedicated variable. This would also make it easier to spot the scoping of the lock, IMO.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed!
|
||
match write_util(&mon.monitor.inner.lock().unwrap(), true, &mut ser_chan) { | ||
Ok(_) => { | ||
let mut ser_channel = Vec::new(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think instead of creating a new Vec
and then writing to it, you should be able to just call encode
on the PeerStorageMonitorHolder
.
But I'm currently confused what we use ser_channel
to begin with is it just to calculate the length below? That seems like a big unnecessary allocation? You could use serialized_length
for example and keep track of the written bytes and compare them to MAX_PEER_STORAGE_SIZE
.
} | ||
|
||
let mut serialised_channels = Vec::new(); | ||
monitors_list.write(&mut serialised_channels).unwrap(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here, just use encode
.
monitors_list.monitors.push(peer_storage_monitor); | ||
}, | ||
Err(_) => { | ||
panic!("Can not write monitor for {}", mon.monitor.channel_id()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Really, please avoid these explicit panics in any of this code.
/// | ||
/// [`ChainMonitor`]: crate::chain::chainmonitor::ChainMonitor | ||
#[rustfmt::skip] | ||
pub(crate) fn write_util<Signer: EcdsaChannelSigner, W: Writer>(channel_monitor: &ChannelMonitorImpl<Signer>, is_stub: bool, writer: &mut W) -> Result<(), Error> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: We often use an internal_
prefix or _internal
suffix when splitting out functionality into util methods. This is no strict rule, but given the precedence it might be an easier-to-understand name.
let per_peer_state = self.per_peer_state.read().unwrap(); | ||
|
||
let mut cursor = io::Cursor::new(decrypted); | ||
match <PeerStorageMonitorHolderList as Readable>::read(&mut cursor) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: It might be cleaner to split the read call to its own line using Readable::read
. Given that we don't want to panic, we can probably also avoid the huge match
by using map_err
or unwrap_or_else
.
In this PR, we begin serializing the ChannelMonitors and sending them over to determine whether any states were lost upon retrieval.
The next PR will be the final one, where we use FundRecoverer to initiate a force close and potentially go on-chain using a penalty transaction.
Sorry for the delay!